Steve Elston
09/18/2023
Probability theory is the basis of statistics, machine learning, and much AI
An understanding of probability theory is an important foundation to understand these methods
In this lesson we will review some basic concepts
Many texts provide comprehensive introductions to probability theory
First probability textbook Credit, Wikipedia commons
Probability distributions are models for uncertainty of random variables
\[X(\omega) \rightarrow \mathbb{R}\]
Example: The mapping can be a count
Example: The function maps \(\omega\) to a real number, \(\mathbb{R}\)
This concept appears abstract at first glance, but is fundamental
to the theory of probability
We will see many examples in this course
For discrete distributions, we can speak of a set of events within the sample space of all possible events
\[0 \le P(A) \le 1 \]
\[P(S) = \sum_{a_i \in A}P(a_i) = 1 \]
\[P(A\ \cup B) = P(A) + P(B)\\ if\ A \perp B\]
From these three axioms we can draw some useful conclusions
Events which cannot occur have probability 0
Events that must occur have probability 1
Events must have a probability mass function between 0 and 1
What value we should expect to find when we sample a random variable?
This is the expected value or simply the expectation
For \(n\) samples, \(\mathbf{X} = x_1, x_2, \ldots, x_n\), of a random variable probability mass function \(p(x_i)\) the expected value is:
\[\mathrm{E}[\mathbf{X}] = \sum_{i=1}^n x_i\ p(x_i)\]
How can we interpret expectation?
Expectation is a probability weighted sum of the sample of the random variable \(\mathbf{X}\)
By the second axiom of probability the weights must sum to 1.0
Useful properties of expectation
The relationship is linear in probability
The expectation of the sum of two random variables, \(X\) and \(Y\), is the sum of the expectations:
\[\mathrm{E}[\mathbf{X + Y}] = \mathrm{E}[\mathbf{X}] + \mathrm{E}[\mathbf{Y}]\]
\[\mathrm{E}[\mathbf{a\ X + b}] = a\ \mathrm{E}[\mathbf{X}] + b\]
Axioms of probability for continuous probability density function, \(f(x)\)
\[0 \le \int_{x_1}^{x_2} f(x) dx\ \le 1\]
Note: if \(x_1 = x_2\) the integral is 0
\[\int_{lower}^{upper} f(x) dx = 1\]
Note: many distributions lower = \(0\) or \(-\infty\) and upper = \(\infty\)
\[P(A\ \cup B) = P(A) + P(B)\ \\ if\ A \perp B\]
Expected value with PDF \(f(x)\), over the interval, \(\{ a, b \}\):
\[\mathrm{E}[\mathbf{X}] = \int_{a}^b x\ f(x)\ dx\]
Bernoulli distributions model the results of a single trial or single realization with a binary outcome
\[\begin{align} P(x\ |\ p) &= \bigg\{ \begin{matrix} p \rightarrow x = 1\\ (1 - p) \rightarrow x = 0 \end{matrix}\\ or\\ P(x\ |\ p) &= p^x(1 - p)^{(1-x)}\ x, \in {0, 1} \end{align}\]
Model the number of successful outcomes, \(k\), in \(N\) trials with the Binomial distribution
Binomial distribution is product of multiple Bernoulli trials:
\[P(k\ |\ N, p) = \binom{N}{k} p^k(1 - p)^{(N-k)}\]
Product of Bernoulli trials is normalized by the Binomial coefficient
The expected number of success \(k\) in \(N\) trials can be computed:
\[k = p\ N\]
Many real-world cases have many possible outcomes
In these cases need a probability distribution for multiple outcomes
Categorical distribution models multiple outcomes
Categorical Distribution is the multiple-outcome extension of the Bernoulli distribution, and is sometimes call the Multinoulli distribution.
Sample space of \(k\) possible outcomes, \(\mathcal{X} = (e_1,e_2, \ldots, e_k)\).
For each trial, there can only be one outcome
For outcome \(i\) we can encode the results as:
\[\mathbf{e_i} = (0, 0, \ldots, 1, \ldots, 0)\]
Only \(1\) value \(e_i\) has a value of \(1\); one hot encoding
For a single trial the probabilities of the \(k\) possible outcomes are expressed:
\[\begin{align} \Pi &= (\pi_1, \pi_2, \ldots, \pi_k) \\ \\ with\ \sum_{i}\pi_i &= 1 \end{align}\]
And consequently, we can write the simple probability mass function as:
\[f(x_i| \Pi) = \pi_i\]
For a series \(N\) of trials we can estimate each of the probabilities of the possible outcomes, \((\pi_1, \pi_2, \ldots, \pi_k)\):
\[\pi_i = \frac{\#\ e_i}{N}\]
Where \(\#\ e_i\) is the count of outcome \(e_i\).
\[\#\ e_i = \pi_i\ N\]
For the case of \(k=3\) you can visualize the possible outcomes of a single Categorical trial
Each discrete outcome must fall at one of the corners of a simplex
The probabilities of of each outcome is \((\pi_1, \pi_2, \pi_3)\).
Simplex for \(Mult_3\)
Poisson distribution models the probability, \(P\), of x arrivals within the time period
Poisson process is an example of a point process
The average number of arrivals of the Poisson process is referred to as the intensity of the process
Write the Poisson distribution in terms of the average arrival rate, \(\lambda\) as:
\[ P(x\ |\ \lambda) = \frac{\lambda^x}{x!} \exp^{-\lambda} \]
The mean and variance of the Poisson distribution are both equal to the parameter \(\lambda\), or:
\[\begin{align} Mean &= \lambda\\ Variance &= \lambda \end{align}\]
Poisson distribution models the probability, \(P\), of x arrivals within the time period
Poisson distribution for several arrival rates
Uniform distribution has flat PDF between limits \(\{ a, b \}\)
Uniform distributions are fundamental to random sampling of data and in simulation
Transformations of the Uniform distribution are typically used to generate realizations of other distributions in computational statistics.
Write the probability of the Uniform distribution as:
\[ P(x\ | \{a,b \}) = \Bigg\{ \begin{matrix} \frac{1}{(b - a)}\ if\ a \le x \le b \\ else\ 0 \end{matrix} \]
Uniform distribution has the following properties:
\[\begin{align} Mean &= \frac{(a + b)}{2}\\ Variance &= \frac{1}{12}(b - a)^2 \end{align}\]
The expectation of a uniform distribution on the interval \(\{ a,b \}\) is easy to work out:
\[\begin{align} \mathrm{E}_{a,b}(\mathbf{X}) &= \int_a^b x\ p(x)\ dx \\ &= \int_a^b x\ dx \\ &= \frac{x^2}{2}\ \big\rvert_a^b \\ &= \frac{a+b}{2} \end{align}\]
Which is just the mean.
The Normal distribution or Gaussian distribution is one of the most widely used probability distributions
The distribution of mean estimates of observations of a random variable drawn from any distribution converge to a Normal distribution by the central limit theorem (CLT)
Many physical processes produce Normal measurement values
Normal distribution has tractable mathematical properties
For a univariate Normal distribution we can write the density function as:
\[P(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\frac{-(x - \mu)^2}{2 \sigma^2}}\]
The parameters can be interpreted as:
\[\begin{align} \mu &= location\ parameter = mean \\ \sigma &= scale = standard\ deviation \\ \sigma^2 &= Variance \end{align}\]
Normal density for several paramter values
Many practical applications have an \(n\)-dimensional parameter vector in \(\mathbb{R}^n\), requiring multivariate distributions
\[f(\vec{\mathbf{x}}) = \frac{1}{{\sqrt{(2 \pi)^n |\mathbf{\Sigma}|}}}exp \big(\frac{1}{2} (\vec{\mathbf{x}} - \vec{\mathbf{\mu}})^T \mathbf{\Sigma} (\vec{\mathbf{x}} - \vec{\mathbf{\mu}})\big)\]
\(|\mathbf{\Sigma}|\) is the determinant of the covariance matrix.
Along the diagonal the values are the \(n\) variances of each dimension, \(\sigma_{i,i}\)
Off-diagonal terms describe the dependency between the \(n\) dimensions of the distribution.
We can write the covariance matrix:
\[ \mathbf{\Sigma} = \begin{bmatrix} \sigma_{1,1} & \sigma_{1,2} & \ldots & \sigma_{1,n} \\ \sigma_{2,1} & \sigma_{2,2} & \ldots & \sigma_{2,n} \\ \vdots & \vdots & \vdots & \vdots \\ \sigma_{n,1} & \sigma_{n,2} & \ldots & \sigma_{n,n} \\ \end{bmatrix} \]
For a Normally distributed n-dimensional multivariate random variable, \(\sigma_{i,j}\) computed from the sample, \(\mathbf{X}\):
\[\begin{align} \sigma_{i,j} &= \mathrm{E} \big[ (\vec{x}_i - \mathrm{E}[\vec{x}_i]) \cdot (\vec{x}_j - \mathrm{E}[\vec{x}_j]) \big] \\ &= \mathrm{E} \big[ (\vec{x}_i - \bar{x}_i) \cdot (\vec{x}_j - \bar{x}_j) \big] \\ &= \frac{1}{k}(\vec{x}_i - \bar{x}_i) \cdot (\vec{x}_j - \bar{x}_j) \end{align}\]
Where \(\cdot\) is the inner product operator and \(\bar{x_i}\) is the mean of \(\vec{x_i}\).
2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 1.0 \end{bmatrix}\)
2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.0 \\ 0.0 & 0.5 \end{bmatrix}\)
2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.5 \\ 0.5 & 1.0 \end{bmatrix}\)
2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & -0.5 \\ -0.5 & 1.0 \end{bmatrix}\)
2-dimensional Normal with \(\mu = [0,0]\) and \(\sigma = \begin{bmatrix} 1.0 & 0.9 \\ 0.9 & 1.0 \end{bmatrix}\)
Log Normal distribution is defined for continuous random variables in
the range \(0 < x \le \infty\)
- Examples price, weight, length, and volume
\[P(x) = \frac{1}{x} \frac{1}{\sigma \sqrt{2 \pi}} \exp{\frac{-(log(x) - \mu)^2}{2 \sigma^2}}\]
Log Normal and log transformed example
Student t-distribution, or simply the t-distribution, is of importance in statistics since it is the distribution of the difference of means of two Normally distributed random variables
t-distribution has one parameter, the degrees of freedom, denoted as \(\nu\)
The PDF of the t-distribution is a rather complex looking result:
\[ P(x\ |\ \nu) = \frac{\Gamma(\frac{\nu + 1}{2})}{\sqrt{\nu \pi} \Gamma(\frac{\nu}{2})} \bigg(1 + \frac{x^2}{\nu} \bigg)^{- \frac{\nu + 1}{2}}\\ where\\ \Gamma(x) = Gamma\ function \]
Dispursion of student-t distribution determined by DOF, \(\nu\)
- Low DOF has heavy tails compared to Normal
- Student-t \(\rightarrow\) standard
Normal as \(DOF \rightarrow
\infty\)
Heat map of
Gamma family of distributions includes several members which are important in statistics
Gamma distributions are a two-parameter exponential family
PDF is defined in the range \(0 \le x \le \infty\)
Gamma family are used in problems, ranging from measurements of physical systems to hypothesis testing
Gamma family can be parameterized in several ways; we will use:
- A shape parameter, \(\nu\), the
degrees of freedom
- A scale parameter, \(\sigma\)
\[ Gam(\nu,\sigma)=\frac{x^{\nu-1}\ e^{-x/\sigma}}{\sigma^\nu\ \Gamma(\nu)}\\ where\\ x \ge 0,\ \nu > 0,\ \sigma > 0\\ and\\ \Gamma(\nu) = Gamma\ function \]
Two useful special cases of the Gamma distribution are:
\(Gam(1,1/\lambda)\) is the
exponential distribution with decay constant \(\lambda\), and PDF:
\[exp(\lambda) = \lambda e^{-\lambda
x}\]
\(Gam(\nu/2,2) = \chi^2_\nu\) is the Chi-squared distribution with \(\nu\) degrees of freedom The \(\chi^2_\nu\) distribution has many uses in statistics.
The \(\chi^2\) distribution is used to construct parametric hypothesis tests of differences in counts between groups and also:
Constructing tests for the significance of fits of observed values to probability distributions.
The likelihood ratio test for the significance of differences between nested models.
Computing confidence intervals for empirical (as opposed to theoretical) variance estimates of observed values.
The \(\chi^2\) distribution is a parametric distribution with a single parameter, the degrees of freedom, k = number of possible outcomes - 1.
\[Q = \sum_{i=1}^n Z_i^2\]
\[Q = \chi^2_k = \chi^2(k)\]
The shape of the \(\chi^2\) distribution changes character with the DoF: For \(k = 1\ or\ 2\) the \(\chi^2\) distribution has an exponential decay with the maximum value at \(x=0\)
Heat map of
The shape of the \(\chi^2\) distribution changes character with the DoF: For a middle range of DoF values the density starts at 0 and rises to a maximum or mode and then decay back toward 0
Heat map of
The shape of the \(\chi^2\) distribution changes character with the DoF: For large values of DoF the \(\chi^2\) distribution converges toward a normal distribution with location parameter DoF
Heat map of
Odds are the ratio of the number of ways an event occurs to the number of ways it does not occur
Can say that odds are the count of events in favor of an event vs. the count against the event
Example: Flip a fair coin, odds of getting heads are \(1:1\) (1 in 1)
Example: Roll a single fair die your odds of rolling a 6 are \(1:5\) (1 in 5), or 0.2
What is the relationship between odds and probability of an event?
\[P(A) = \frac{A}{S} = \frac{A}{A + (S - A)} = \frac{A}{A + B} = \frac{count\ in\ favor}{count\ in\ favor\ + count\ not\ in\ favor}\ \\ which\ implies\\ odds = A:(S-A)\]
\[P(H) = \frac{1}{1 + 1} = \frac{1}{2}\]
\[0 < P(A) \le 1 \] \[P(S) = \sum_{a_i \in A}P(a_i) = 1 \] \[P(A\ \cup B) = P(A) + P(B)\\ if\ A \perp B\]
\[\mathrm{E}[\mathbf{X}] = \sum_{i=1}^n x_i\ p(x_i)\]
\[f(x_i| \Pi) = \pi_i\]